Search CORE

28 research outputs found

Interlingua based neural machine translation

Author: Escolano Peinado Carlos
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2018
Field of study

We propose a machine translation architecture based on autoencoders and a shared interlingua representation that produce comparable results to state of the art systems. Also we define evaluation and visualization strategies as metrics of the performance of the architecture

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Learning multilingual and multimodal representations with language-specific encoders and decoders for machine translation

Author: Escolano Peinado Carlos
Publication venue: Universitat Politècnica de Catalunya
Publication date: 18/03/2022
Field of study

This thesis aims to study different language-specific approaches for Multilingual Machine Translation without parameter sharing and their properties compared to the current state-of-the-art based on parameter-sharing. We define Multilingual Machine Translation as the task that focuses on methods to translate between several pairs of languages in a single system. It has been widely studied in recent years due to its ability to easily scale to more languages, even between pairs never seen together during training (zero-shot translation). Several architectures have been proposed to tackle this problem with varying amounts of shared parameters between languages. Current state-of-the-art systems focus on a single sequence-to-sequence architecture where all languages share the complete set of parameters, including the token representation. While this has proven convenient for transfer learning, it makes it challenging to incorporate new languages into the trained model as all languages depend on the same parameters. What all proposed architectures have in common is enforcing a shared presentation space between languages. Specifically, during this work, we will employ as representation the final output of the encoders that the decoders will use to perform cross-attention. Having a shared space reduces noise as similar sentences at semantic level produce similar vectorial representations, helping the decoders process representations from several languages. This semantic representation is particularly important for zero-shot translation as the representation similarity to the languages pairs seen during training is key to reducing ambiguity between languages and obtaining good translation performance. This thesis is structured in three main blocks, focused on different scenarios of this task. Firstly, we propose a training method that enforces a common representation for bilingual training and a procedure to extend it to new languages efficiently. Secondly, we propose another training method that allows this representation to be learned directly on multilingual data and can be equally extended to new languages. Thirdly, we show that the proposed multilingual architecture is not limited only to textual languages. We extend our method to new data modalities by adding speech encoders, performing Spoken Language Translation, including Zero-Shot, to all the supported languages. Our main results show that the common intermediate representation is achievable in this scenario, matching the performance of previously shared systems while allowing the addition of new languages or data modalities efficiently without negative transfer learning to the previous languages or retraining the system.El objetivo de esta tesis es estudiar diferentes arquitecturas de Traducción Automática Multilingüe con parámetros específicos para cada idioma que no son compartidos, en contraposición al estado del arte actual basado en compartir parámetros. Podemos definir la Traducción Automática Multilingüe como la tarea que estudia métodos para traducir entre varios pares de idiomas en un único sistema. Ésta ha sido ampliamente estudiada en los últimos años debido a que nos permite escalar nuestros sistemas con facilidad a un gran número de idiomas, incluso entre pares de idiomas que no han sido nunca entrenados juntos (traducción zero-shot). Diversas arquitecturas han sido propuestas con diferentes niveles de parámetros compartidos entre idiomas, El estado del arte actual se enfoca hacía un solo modelo secuencia a secuencia donde todos los parámetros son compartidos por todos los idiomas, incluyendo la representación a nivel de unidad lingüística. Siendo esto beneficioso para la transferencia de conocimiento entre idiomas, también puede resultar una limitación a la hora de añadir nuevos, ya que modificaríamos los parámetros para todos los idiomas soportados. El elemento común de todas las arquitecturas propuestas es promover un espacio común donde representar a todos los idiomas en el sistema. Concretamente, durante este trabajo, nos referiremos a la representación final de los codificadores del sistema como este espacio, puesto que es la representación utilizada durante la atención cruzada por los decodificadores al generar traducciones. El objetivo de esta representación común es reducir ruido, ya que frases similares producirán representaciones similares, lo cual resulta de ayuda al usar un mismo decodificador para procesar la representación vectorial de varios idiomas. Esto es especialmente importante en el caso de la traducción zero-shot, ya que el par de idiomas no ha sido nunca entrenado conjuntamente, para reducir posibles ambigüedades y obtener una buena calidad de traducción. La tesis está organizada en tres bloques principales, enfocados en diferentes escenarios de esta tarea. Primero, proponemos un método para entrenar una representación común en sistemas bilingües, y un procedimiento para extenderla a nuevos idiomas de manera eficiente. Segundo, proponemos otro método de entrenamiento para aprender esta representación directamente desde datos multilingües y como puede ser igualmente extendida a nuevos idiomas. Tercero, mostramos que esta representación no está limitada únicamente a datos textuales. Para ello, extendemos nuestro método a otra modalidad de datos, en este caso discurso hablado, demostrando que podemos realizar traducción de audio a texto para todos los idiomas soportados, incluyendo traducción zero-shot. Nuestros resultados muestras que una representación común puede ser aprendida sin compartir parámetros entre idiomas, con una calidad de traducción similar a la del actual estado del arte, con la ventaja de permitirnos añadir nuevos idiomas o modalidades de datos de manera eficiente, sin transferencia negativa de conocimiento a los idiomas ya soportados y sin necesidad de reentrenarlos.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Integración de conocimiento morfológico en un sistema de traducción estadístico chino-castellano

Author: Escolano Peinado Carlos
Publication venue: Universitat Politècnica de Catalunya
Publication date: 29/06/2016
Field of study

Con este proyecto pretendemos a partir de una traducción chino-castellano simplificado, en la cual hemos eliminado la información morfológica, crear una arquitectura que permita recuperar esa información y generar una traducción completa, con los beneficios de realizar una traducción simplificada.The aim of this project is, based on a Chinese to simplified Spanish translation, to whom the morphological information was removed, develop an architecture that allows us the recover this information and generate a full translation, with the benefits of a simplified translation

UPCommons. Portal del coneixement obert de la UPC

Chinese-Catalan: A neural machine translation approach based on pivoting and attention mechanisms

Author: Casas Manzanares Noé
Escolano Peinado Carlos
Rodríguez Fonollosa José Adrián
Ruiz Costa-Jussà Marta
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

This article innovatively addresses machine translation from Chinese to Catalan using neural pivot strategies trained without any direct parallel data. The Catalan language is very similar to Spanish from a linguistic point of view, which motivates the use of Spanish as pivot language. Regarding neural architecture, we are using the latest state-of-the-art, which is the Transformer model, only based on attention mechanisms. Additionally, this work provides new resources to the community, which consists of a human-developed gold standard of 4,000 sentences between Catalan and Chinese and all the others United Nations official languages (Arabic, English, French, Russian, and Spanish). Results show that the standard pseudo-corpus or synthetic pivot approach performs better than cascade.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Enriching the transformer with linguistic factors for low-resource machine translation

Author: Armengol Estapé Jordi
Escolano Peinado Carlos
Ruiz Costa-Jussà Marta
Publication venue: 'Assoc. for Computational Linguistics Bulgaria'
Publication date: 01/01/2021
Field of study

Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures. This study proposes enhancing the current state-of-the-art neural machine translation architecture, the Transformer, so that it allows to introduce external knowledge. In particular, our proposed modification, the Factored Transformer, uses linguistic factors that insert additional knowledge into the machine translation system. Apart from using different kinds of features, we study the effect of different architectural configurations. Specifically, we analyze the performance of combining words and features at the embedding level or at the encoder level, and we experiment with two different combination strategies. With the best-found configuration, we show improvements of 0.8 BLEU over the baseline Transformer in the IWSLT German-to-English task. Moreover, we experiment with the more challenging FLoRes English-to-Nepali benchmark, which includes both extremely low-resourced and very distant languages, and obtain an improvement of 1.2 BLEUThis work is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 947657).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Byte-based neural machine translation

Author: Escolano Peinado Carlos
Rodríguez Fonollosa José Adrián
Ruiz Costa-Jussà Marta
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

This paper presents experiments compar- ing character-based and byte-based neural machine translation systems. The main motivation of the byte-based neural ma- chine translation system is to build multi- lingual neural machine translation systems that can share the same vocabulary. We compare the performance of both systems in several language pairs and we see that the performance in test is similar for most language pairs while the training time is slightly reduced in the case of byte-based neural machine translation.Postprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Multilingual machine translation: Deep analysis of language-specific encoder-decoders

Author: Escolano Peinado Carlos
Rodríguez Fonollosa José Adrián
Ruiz Costa-Jussà Marta
Publication venue: 'AI Access Foundation'
Publication date: 25/04/2022
Field of study

State-of-the-art multilingual machine translation relies on a shared encoder-decoder. In this paper, we propose an alternative approach based on language-specific encoder-decoders, which can be easily extended to new languages by learning their corresponding modules. To establish a common interlingua representation, we simultaneously train N initial languages. Our experiments show that the proposed approach improves over the shared encoder-decoder for the initial languages and when adding new languages, without the need to retrain the remaining modules. All in all, our work closes the gap between shared and language-specific encoder-decoders, advancing toward modular multilingual machine translation systems that can be flexibly extended in lifelong learning settings.This work is supported by the European Research Council (ERC) under the European Union’sHorizon 2020 research and innovation programme (grant agreement No. 947657).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

The TALP–UPC Spanish–English WMT biomedical task: bilingual embeddings and char-based neural language model rescoring in a phrase-based system

Author: Escolano Peinado Carlos
España-i-Bonet Cristina
Madhyastha Pranava
Rodríguez Fonollosa José Adrián
Ruiz Costa-Jussà Marta
Publication venue
Publication date: 01/01/2016
Field of study

This paper describes the TALP–UPC system in the Spanish–English WMT 2016 biomedical shared task. Our system is a standard phrase-based system enhanced with vocabulary expansion using bilingual word embeddings and a characterbased neural language model with rescoring. The former focuses on resolving outof- vocabulary words, while the latter enhances the fluency of the system. The two modules progressively improve the final translation as measured by a combination of several lexical metrics.Postprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

From bilingual to multilingual neural machine translation by incremental training

Author: Escolano Peinado Carlos
Rodríguez Fonollosa José Adrián
Ruiz Costa-Jussà Marta
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Multilingual Neural Machine Translation approaches are based on the use of task specific models and the addition of one more language can only be done by retraining the whole system. In this work, we propose a new training schedule that allows the system to scale to more languages without modification of the previous components based on joint training and language-independent encoder/decoder modules allowing for zero-shot translation. This work in progress shows close results to state-of-the-art in the WMT task.This work is supported in part by a Google Faculty Research Award. This work is also supported in part by the Spanish Ministerio de Economa y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigacin, through the postdoctoral senior grant Ramon y Cajal, contract TEC2015-69266-P (MINECO/FEDER,EU) and contract PCIN-2017- 079 (AEI/MINECO).Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

The TALP-UPC participation in WMT21 news translation task: an mBART-based NMT approach

Author: Basta Christine Raouf Saad
Escolano Peinado Carlos
Ferrando Monsonís Javier
Rodríguez Fonollosa José Adrián
Ruiz Costa-Jussà Marta
Tsiamas Ioannis
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

This paper describes the submission to the WMT 2021 news translation shared task by the UPC Machine Translation group. The goal of the task is to translate German to French (De-Fr) and French to German (Fr-De). Our submission focuses on fine-tuning a pre-trained model to take advantage of monolingual data. We fine-tune mBART50 using the filtered data, and additionally, we train a Transformer model on the same data from scratch. In the experiments, we show that fine-tuning mBART50 results in 31.69 BLEU for De-Fr and 23.63 BLEU for Fr-De, which increases 2.71 and 1.90 BLEU accordingly, as compared to the model we train from scratch. Our final submission is an ensemble of these two models, further increasing 0.3 BLEU for Fr-De.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC